GPT-2 Feature Interpretability Dashboard

Exploring 15,399 interpretable features from 72 TopK SAEs

41.8%

Interpretability Rate

15,399 / 36,864 features

72.6%

Monosemantic Rate

11,187 / 15,399 features

49,748

Total Correlations

Avg 3.2 per feature

66.4%

Number Features

Dominant feature type
Feature ID Layer Position Label Mono Top Enrichment Top Token Correlations Details

Model Comparison: GPT-2 vs LLaMA

Key Finding: Model Capacity vs Interpretability

GPT-2 (124M): 72.6% monosemantic rate, 29.3% sparsity

LLaMA (1B): ~50% estimated monosemantic rate, 19.5% sparsity

Smaller models require more specialized, monosemantic features to compensate for limited capacity.